Steps of data analysis: -Define the question
-Define the ideal data set
-Determine what data you can access
-Obtain the data
-Clean the Data
-Exploratory data analysis
-Statistical prediction/Modeling
-Interpret results
-Challenge results
-Synthesize/Write up results
-Create reproducible code
P(A u B) = P(A) + P(B) - The probability of the union between scenario A and B is simply the sum of each. Random variable = is the numerical outcome of an experiment (discrete or continuous) - e.g. the flip of a coin (discrete random variable) - e.g. the roll of a dice (also discrete random variable because it’s limited to 1-6) - e.g. BMI of a person is a continuous random variable
For random variables, the probabilities must add up to 1, and components are larger or equal to 0.
Probability mass function - for discrete random variables
Probability density function - for continuous random variables
in R, probability of some instance occurring within a continuous distribution can be calculated by using pbeta(). Here, p before an argument represents "probability". q before an argument represents "quantile"
Conditional probability- think lightning strike on a storming vs. sunny day.
P(A|B) = P(A u B)/P(B) is the conditional probability. Written as the probability of A given condition B = ....
mean = center of a distribution variance and standard deviations = how spread out the distribution is
## Beta distribution
pbeta(0.75, 2, 1) ##here 0.75 is 75% probability of some density.
## [1] 0.5625
pbeta(c(.4, .5, .6), 2, 1) ##40-60% probability example of the same density.
## [1] 0.16 0.25 0.36
pnorm(70, mean = 80, sd = 10) ## probability to have value 70 given a set mean and standard dev. Data is normally distributed.
## [1] 0.1586553
qnorm(0.95, mean=1100, sd = 75) ## will return a value of a normal distribution given a specificed percentile (95%), and a known mean and st.dev.
## [1] 1223.364
pnorm(16, mean = 15, sd = 1) - pnorm(14, mean = 15, sd = 1) ##probaility that a an output is between two known values, i.e. 14 and 16
## [1] 0.6826895
ppois(10, lambda = 15) ##poisson distributed data. Probability of seeing 10 or less of an instance, where lamda represents 3 hours,l and 5 people per hour, hence 15.
## [1] 0.1184644
g <- ggplot(galton, aes(x = child))
g <- g + geom_histogram(fill = "salmon",
binwidth=1, aes(y = ..density..), colour = "black")
g <- g + geom_density(size = 2)
g <- g + geom_vline(xintercept = mean(galton$child), size = 2)
g
lambda <- 0.2
nsim <- 1:1000 # Number of Simulations/rows
n <- 40
Ematrix <- data.frame(x = sapply(nsim, function(x) {mean(rexp(n, lambda))})) ##creating a matrix of values
head(Ematrix)
Smean <- apply(Ematrix, 2, mean) ## calculating the simulated mean from above matrix.
Smean
## x
## 5.00173
Tmean <- 1/lambda ##calculating the theoretical mean given lambda
Tmean
## [1] 5
SSD <- sd(Ematrix$x) ##The simulated standard deviation
SSD
## [1] 0.8158697
SVar <- var(Ematrix$x) ##The simulated variance from the above matrix.
SVar
## [1] 0.6656433
TSD <- (1/lambda)/sqrt(n) ##The theoretical standard deviation.
TSD
## [1] 0.7905694
TVar <- TSD^2 ##The theoretical calculation for the Variance
TVar
## [1] 0.625
(X - X*)/(s/sqrt(n)) in other words, sample mean minus the hypothesized/test mean or value divided by standard error of the mean divided by square root of sample size, n. This follows a t-distribution with n-1 degrees of freedom. You can calculate the T-distribution using qt(.95, 15), where .95 is the percentile or an alpha of 0.05, and 15 is n-1 in this example.
library(UsingR); data(father.son)
t.test(father.son$sheight - father.son$fheight) ##testing whether there are significant differences between the two columns, one for father height and one for son's height.
##
## One Sample t-test
##
## data: father.son$sheight - father.son$fheight
## t = 11.789, df = 1077, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.8310296 1.1629160
## sample estimates:
## mean of x
## 0.9969728
library(datasets)
data(ToothGrowth)
head(ToothGrowth)
library(ggplot2)
plot0 <- ggplot(ToothGrowth, aes(supp, len, fill = supp))
plot1 <- plot0 + geom_dotplot(binaxis ="y") +
facet_grid(.~ dose)
plot1
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
t.test(len ~ supp, data = ToothGrowth) ##testing sig difference between supp types and length
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
test1 <- subset(ToothGrowth, dose %in% c(.5,1)) ##subsetting doses to make pairwise comparisons
t.test(len ~ dose, data = test1) ##testing sig differences between discrete doses and length
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means between group 0.5 and group 1 is not equal to 0
## 95 percent confidence interval:
## -11.983781 -6.276219
## sample estimates:
## mean in group 0.5 mean in group 1
## 10.605 19.735
test2 <- subset(ToothGrowth, dose %in% c(1,2)) ##subsetting doses to make pairwise comparisons
t.test(len ~ dose, data = test2)
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2
## 19.735 26.100
## Prompt for below: a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain vol
mn = 1100
s = 30
n = 9
round(1100 + c(-1,1)*qt(.975, df = 8)*s/sqrt(n))
## [1] 1077 1123
## diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up - baseline) is -2 pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?
n = 9
mn_dif = 2
t = .95
(y_d <- round(mn_dif*sqrt(n) / qt(.975, df = 8), 2))
## [1] 2.6
##Running a two-sample t-test
data(ChickWeight); library(reshape2)
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight") ##dcasting the long-form of the dataset into the wide form by chick, and diet, cast over the time variable, where the represented data is the weight.
names(wideCW)[-(1:2)] <- paste( "time", names(wideCW)[-(1:2)], sep= "") ##renames columns after the first row, second column, and pastes "time" before the existing column header.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
wideCW <- mutate(wideCW, gain = time21-time0) ##adds a column titled gain, where it subtracts or takes the difference between time 21 and 0, and enters a new column.
wideCW_14 <- subset(wideCW, Diet %in% c(1,4))
t.test(gain~Diet, paired = FALSE, var.equal = TRUE, data = wideCW_14)
##
## Two Sample t-test
##
## data: gain by Diet
## t = -2.7252, df = 23, p-value = 0.01207
## alternative hypothesis: true difference in means between group 1 and group 4 is not equal to 0
## 95 percent confidence interval:
## -108.14679 -14.81154
## sample estimates:
## mean in group 1 mean in group 4
## 136.1875 197.6667
pt(2.5, 15, lower.tail = FALSE) ##the probability of getting a T-statistic of 2.5 (first element of the function), with a a df of 15 (i.e. n-1).
## [1] 0.0122529
Sum[i=1 to n] of [Yi - (Bo +B1Xi)]^2 Least squares best fit is the sum of i to the n of Yi which is the i’th sample data y-coordinate minus the intercept plus the slope times the i’th sample data x-coordinate, squarred. Bo = intercept of y-axis B1 = slope Yi = y of sample i Xi = x of sample i Think of it as subtracting the distance between sample’s y-coordinate from the best fit, which is y - (ax + b) which is a linear regression of the best fit, or Yi - Yfit. Then square it.
library(UsingR)
library(ggplot2)
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+5, show_guide = FALSE))
## Warning: Ignoring unknown aesthetics: show_guide
g <- g + geom_point(aes(colour=freq, size = freq-5))
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g
###_____________________Linear Least Squares calculated
y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) * sd(y) / sd(x) ##The slope, which is the corelation between y and x times the st.dev of y over the st.dev of x
beta0 <- mean(y) - beta1 * mean(x) ##calculates the y-intercept of the best fit.
rbind(c(beta0, beta1), coef(lm(y ~ x))) ##lm stands for linear model. coef takes the output and gives us coefficients. REMEMBER - y is the outcome, and x is the predictor.
## (Intercept) x
## [1,] 23.94153 0.6462906
## [2,] 23.94153 0.6462906
##This example gives you the same output, because one is the manual calculation, while the other is done automatically using lm.
beta1 <- cor(y, x) * sd(x) / sd(y) ##if we swapped the predictor
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0, beta1), coef(lm(x ~ y))) ##shows us the slope is the same
## (Intercept) y
## [1,] 46.13535 0.3256475
## [2,] 46.13535 0.3256475
yc <- y - mean(y)
xc <- x - mean(x)
beta1 <- sum(yc * xc) / sum(xc ^ 2)
c(beta1, coef(lm(y ~ x))[2]) ##returns the same slope
## x
## 0.6462906 0.6462906
###simplified
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+5, show_guide = FALSE))
## Warning: Ignoring unknown aesthetics: show_guide
g <- g + geom_point(aes(colour=freq, size = freq-10))
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g <- g + geom_smooth(method="lm", col ="orange", formula=y~x) ##for fitting the linear best fit to a plot
g
library(UsingR)
data(diamond)
library(ggplot2)
g = ggplot(diamond, aes(x = carat, y = price))
g = g + xlab("Mass (carats)")
g = g + ylab("Price (SIN $)")
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 4, colour = "black", alpha=0.5)
g = g + geom_point(size = 2, colour = "blue", alpha=0.2)
g
## `geom_smooth()` using formula 'y ~ x'
fit <- lm(price~ carat, data = diamond) ## prints coefficient Bo and B1, slope and intercept
coef(fit)
## (Intercept) carat
## -259.6259 3721.0249
summary(fit) ##gives you the complete statistic
##
## Call:
## lm(formula = price ~ carat, data = diamond)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.159 -21.448 -0.869 18.972 79.370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.63 17.32 -14.99 <2e-16 ***
## carat 3721.02 81.79 45.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared: 0.9783, Adjusted R-squared: 0.9778
## F-statistic: 2070 on 1 and 46 DF, p-value: < 2.2e-16
fit2 <- lm(price ~ I(carat - mean(carat)), data=diamond) ##how you mean center your predictor, carat. I function lets you provide a calculation in the predictor argument.
coef(fit2)
## (Intercept) I(carat - mean(carat))
## 500.0833 3721.0249
fit3 <- lm(price ~ I(carat * 10), data=diamond) ##shortcut where fit3 produces the price change for 1/10 a carat, not a complete carat.
coef(fit3)
## (Intercept) I(carat * 10)
## -259.6259 372.1025
newx <- c(0.16, 0.27, 0.34) ## an example of a list of carat sizes for diamonds, and we want to predict the price.
predict(fit, newdata = data.frame(carat = newx)) ##taking original fit data, and calling to the newx concatenated values with the category of carat(carat is what it is referencing). new data is designating new x values.
## 1 2 3
## 335.7381 745.0508 1005.5225
data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit) ##easiest way to calculate residuals
yhat <- predict(fit)
max(abs(e -(y - yhat)))
## [1] 9.485746e-13
max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x))) ##will return the same as resid()
## [1] 9.485746e-13
plot(x, e,
xlab = "Mass (carats)",
ylab = "Residuals (SIN $)",
bg = "lightblue",
col = "black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
##_______________Example of non-linear data and residuals
x = runif(100, -3, 3); y = x + sin(x) + rnorm(100, sd = .2);
library(ggplot2)
library(RColorBrewer)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g
## `geom_smooth()` using formula 'y ~ x'
##_______________
g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), ##How to plot residuals
aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2);
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g
##____________ Calculating the residual and plotting for diamond data
diamond$e <- resid(lm(price ~ carat, data = diamond)) ##adding a new column into data.frame for residuals
g = ggplot(diamond, aes(x = carat, y = e))
g = g + xlab("Mass (carats)")
g = g + ylab("Residual price (SIN $)")
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 7, colour = "black", alpha=0.5)
g = g + geom_point(size = 5, colour = "blue", alpha=0.2)
g
fit <- lm(mpg~wt, data= mtcars) ## creating a linear model of mtcars dataset, where weight is the predictor
predict(fit, wt=3000) ##using the model above, and predicting the mpg for all cars weighing 3000lbs.
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 23.282611 21.919770 24.885952 20.102650
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 18.900144 18.793255 18.205363 20.236262
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 20.450041 18.900144 18.900144 15.533127
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 17.350247 17.083024 9.226650 8.296712
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 8.718926 25.527289 28.653805 27.478021
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 24.111004 18.472586 18.926866 16.762355
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 16.735633 26.943574 25.847957 29.198941
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 20.343151 22.480940 18.205363 22.427495
data <- rnorm(100)
#perform Shapiro-Wilk test for normality
shapiro.test(data) ## where we reject normality when p < 0.05, otherwise we accept that the test data is normally distributed.
##
## Shapiro-Wilk normality test
##
## data: data
## W = 0.99441, p-value = 0.9567
question - input data - features - algorithm - parameters - evaluation
Prediction has accuracy tradeoffs
In-sample error: the error rate you get on the same data set you used to build your predictor. Sometimes called re-substitution error. Out of sample error: the error rate you get on a new data set, sometimes called generalization error. Out of sample error is what we should care about In sample error is always less than out of sample error The reason is over-fitting. True positive vs. false positive, true negative vs. false negative Sensitivity - probability that the test positively predicted, and that the prediction was right Specificity- probability that the test was negative, and so too was the outcome Receiver Operating Characteristic (ROC) - Area under the curve is the measure of goodness of fit (model/prediction to data/outcome). AUC of 0.5 is random guessing. AUC > 0.8 is a good model. Cross-validation. 1.) use the training set 2.) split the training set into a test set and a smaller training set 3.) Build a model on the training set 4.) Evaluate the model on the test set 5.) repeat and average the estimated errors k-fold cross-validation is popular technique.
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
data(spam)
head(spam)
plot(density(spam$your[spam$type=="nonspam"]),
col="blue",main="",xlab="Frequency of 'your'")
lines(density(spam$your[spam$type=="spam"]),col="red")
abline(v=0.5,col="black")
##We want to find a value C, which is the cutoff point between spam frequencies, and non-spam frequencies of the word "your". If frequency is above C, predict spam...
prediction <- ifelse(spam$your > 0.5,"spam","nonspam")
table(prediction,spam$type)/length(spam$type)
##
## prediction nonspam spam
## nonspam 0.4590306 0.1017170
## spam 0.1469246 0.2923278
###Caret for machine learning package
obj Class Package Predict Function Syntax lda MASS Predict(obj) (no options needed) glm stats predict(obj, type = “response”) bgm gbm predict(obj, type = “response”, n.trees) mda mda predict(obj, typoe =“posterior”) rpart rpart predict(obj, type=- “prob”) Weka RWeka predict(obj, type= “probability”) LogitBoost caTools predict(obj, type= “raw”, nIter)
** Different types of predictors have different object class syntax requirements.
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(kernlab) ## to get spam dataset
data(spam)
inTrain <- createDataPartition(y=spam$type, p= 0.75, list = FALSE) ## data is partitioned by spam type, where p is the training proportion.
training <- spam[inTrain,] ##subsetting the output of the partition function by spam
testing <- spam[-inTrain,] ##subsetting the output of the partition function by not spam
dim(training) ##shows the dimensions of the dataframe
## [1] 3451 58
set.seed(11111)
modelFit <- train(type ~., data= training, method = "glm") ##training a model, specifically a glm model, where we are comparing column type (spam vs. not spam) with everything ".", and using the subsetted data for training.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9213441 0.8350731
modelFit$finalModel ##will return the fitted values for all of the other columns of data. The higher the fit, the more likely it's spam.
##
## Call: NULL
##
## Coefficients:
## (Intercept) make address all
## -1.453e+00 -2.097e-01 -1.617e-01 1.525e-01
## num3d our over remove
## 2.070e+00 5.582e-01 6.049e-01 2.195e+00
## internet order mail receive
## 6.044e-01 8.605e-01 1.361e-01 -4.118e-01
## will people report addresses
## -1.976e-01 6.807e-02 6.363e-02 1.047e+00
## free business email you
## 1.236e+00 9.854e-01 -4.050e-02 1.147e-01
## credit your font num000
## 8.911e-01 2.028e-01 2.563e-01 1.770e+00
## money hp hpl george
## 2.007e-01 -2.089e+00 -1.015e+00 -7.784e+00
## num650 lab labs telnet
## 3.942e-01 -1.989e+00 -4.778e-01 -4.912e+00
## num857 data num415 num85
## 1.525e+00 -1.247e+00 1.243e+00 -2.569e+00
## technology num1999 parts pm
## 1.265e+00 2.640e-01 1.668e+00 -8.041e-01
## direct cs meeting original
## -3.565e-01 -5.148e+02 -2.555e+00 -1.211e+00
## project re edu table
## -2.319e+00 -1.069e+00 -1.573e+00 -2.919e+00
## conference charSemicolon charRoundbracket charSquarebracket
## -4.012e+00 -1.247e+00 -2.209e-01 -6.101e-01
## charExclamation charDollar charHash capitalAve
## 2.468e-01 6.951e+00 1.509e+00 -1.323e-02
## capitalLong capitalTotal
## 9.523e-03 7.914e-04
##
## Degrees of Freedom: 3450 Total (i.e. Null); 3393 Residual
## Null Deviance: 4628
## Residual Deviance: 1323 AIC: 1439
predictions <- predict(modelFit, newdata = testing) ##how we can predict on new samples, here using the subset testing data.frame.
predictions
## [1] spam nonspam nonspam spam spam spam nonspam nonspam spam
## [10] spam spam spam spam spam spam spam spam spam
## [19] spam spam spam spam spam spam spam spam spam
## [28] spam spam spam nonspam spam spam nonspam spam spam
## [37] nonspam spam spam spam spam spam spam spam spam
## [46] spam spam spam spam spam spam spam nonspam spam
## [55] spam spam spam spam spam spam spam nonspam spam
## [64] spam nonspam spam nonspam spam spam spam spam spam
## [73] spam spam spam spam spam spam spam spam spam
## [82] spam nonspam spam nonspam spam spam spam nonspam spam
## [91] spam spam spam spam spam spam nonspam spam spam
## [100] spam spam spam spam spam spam spam spam spam
## [109] spam spam spam spam spam spam spam spam spam
## [118] spam spam nonspam spam spam spam spam spam spam
## [127] spam spam spam spam spam spam spam spam spam
## [136] spam spam nonspam spam spam spam spam spam spam
## [145] spam spam spam spam spam spam spam spam spam
## [154] spam nonspam spam spam spam spam spam spam spam
## [163] spam spam spam spam spam spam spam spam spam
## [172] spam spam spam spam spam spam spam spam spam
## [181] spam spam nonspam nonspam spam spam spam spam spam
## [190] spam spam spam spam spam spam spam spam spam
## [199] spam spam spam spam spam spam spam spam spam
## [208] spam spam spam spam spam spam spam spam nonspam
## [217] spam spam spam nonspam spam spam spam spam spam
## [226] spam spam spam spam spam spam spam spam spam
## [235] spam nonspam nonspam spam spam spam nonspam spam spam
## [244] spam spam spam spam spam spam spam spam spam
## [253] nonspam spam spam spam spam spam spam spam spam
## [262] spam spam spam spam spam spam spam spam spam
## [271] spam spam spam spam spam spam spam spam spam
## [280] spam spam spam spam spam nonspam spam spam spam
## [289] spam spam spam spam spam spam spam spam spam
## [298] spam spam spam spam spam spam spam nonspam spam
## [307] spam spam spam spam spam nonspam spam spam spam
## [316] spam nonspam spam spam spam spam spam spam spam
## [325] nonspam spam nonspam spam spam spam spam spam spam
## [334] spam spam spam nonspam spam nonspam spam spam spam
## [343] spam spam spam spam spam spam spam spam spam
## [352] spam spam spam spam spam spam nonspam spam spam
## [361] spam spam spam spam nonspam nonspam spam spam spam
## [370] spam spam spam spam nonspam spam spam spam spam
## [379] spam spam nonspam nonspam spam spam spam spam spam
## [388] spam spam spam spam spam nonspam spam spam spam
## [397] spam spam spam nonspam spam spam nonspam spam spam
## [406] spam spam spam spam spam spam spam spam spam
## [415] spam spam nonspam spam spam spam spam nonspam spam
## [424] spam nonspam spam nonspam spam spam spam spam spam
## [433] nonspam spam nonspam nonspam spam spam nonspam spam spam
## [442] spam spam spam spam spam spam spam spam spam
## [451] spam spam spam nonspam nonspam nonspam nonspam spam nonspam
## [460] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [469] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [478] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [487] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [496] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [505] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [514] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam
## [523] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [532] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [541] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [550] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [559] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [568] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [577] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [586] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [595] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [604] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [613] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [622] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [631] nonspam spam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [640] nonspam nonspam spam nonspam nonspam nonspam nonspam spam nonspam
## [649] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [658] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [667] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
## [676] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [685] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam
## [694] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [703] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [712] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [721] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [730] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [739] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [748] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
## [757] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [766] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [775] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [784] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [793] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [802] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [811] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
## [820] spam nonspam nonspam nonspam nonspam spam spam spam nonspam
## [829] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [838] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [847] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [856] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [865] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [874] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [883] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [892] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [901] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [910] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [919] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [928] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [937] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [946] spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [955] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [964] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [973] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [982] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [991] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1000] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1009] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1018] nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam spam
## [1027] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1036] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1045] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1054] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1063] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1072] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [1081] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1090] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [1099] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
## [1108] nonspam spam spam nonspam nonspam nonspam nonspam nonspam nonspam
## [1117] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1126] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1135] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1144] nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## Levels: nonspam spam
confusionMatrix(predictions, testing$type) ##using the confusingMatrix argument to determine how well the predictions are working, and the focus is on predicting type (spam vs. not spam)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 666 51
## spam 31 402
##
## Accuracy : 0.9287
## 95% CI : (0.9123, 0.9429)
## No Information Rate : 0.6061
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8495
##
## Mcnemar's Test P-Value : 0.03589
##
## Sensitivity : 0.9555
## Specificity : 0.8874
## Pos Pred Value : 0.9289
## Neg Pred Value : 0.9284
## Prevalence : 0.6061
## Detection Rate : 0.5791
## Detection Prevalence : 0.6235
## Balanced Accuracy : 0.9215
##
## 'Positive' Class : nonspam
##
library(caret)
library(kernlab) ## to get spam dataset
data(spam)
inTrain <- createDataPartition(y=spam$type, p= 0.75, list = FALSE) ## data is partitioned by spam type, where p is the training proportion.
training <- spam[inTrain,] ##subsetting the output of the partition function by spam
testing <- spam[-inTrain,] ##subsetting the output of the partition function by not spam
dim(training)
## [1] 3451 58
set.seed(11111)
folds <- createFolds(y= spam$type, k=10, list= TRUE, returnTrain=TRUE) ##k is the number of folds, smap$type is the column being references. You can return either the Training set, or the test set.
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 4141 4140 4141 4142 4142 4141 4140 4140 4141 4141
library(ISLR)
library(ggplot2)
library(caret)
data(wage)
## Warning in data(wage): data set 'wage' not found
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
head(Wage, n=20)
inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training); dim(testing) ##returns the number of rows and columns
## [1] 2102 11
## [1] 898 11
featurePlot(x=training[, c("age", "education", "jobclass")], ##a plotting function that comes with caret, and we're taking a subset of the data, three columns in this instance.
y= training$wage, ##this is your output of interest
plot="pairs")
qplot(age, wage, data=training) ##here we see two distinct groups
qplot(age, wage, colour= jobclass, data= training)
qp <- qplot(age, wage, color= education, data=training)
qp + geom_smooth(method = 'lm', formula= y~x)
library(Hmisc) ##good library for cutting data.frames into pieces.
cutWage <- cut2(training$wage, g=3) ##cut2 argument performs the cut, by $wage column, into g=3 pieces.
table(cutWage)
## cutWage
## [ 20.9, 93) [ 93.0,119) [118.9,318]
## 707 718 677
p1 <- qplot(cutWage, age, data=training, fill=cutWage, geom=c("boxplot"))
p1
t1 <- table(cutWage, training$jobclass) ##great way to make quick tables
t1
##
## cutWage 1. Industrial 2. Information
## [ 20.9, 93) 455 252
## [ 93.0,119) 362 356
## [118.9,318] 276 401
prop.table(t1,1) ##proportion table, where the 1 represents proportion in each row. a 2 would be each column.
##
## cutWage 1. Industrial 2. Information
## [ 20.9, 93) 0.6435644 0.3564356
## [ 93.0,119) 0.5041783 0.4958217
## [118.9,318] 0.4076809 0.5923191
qplot(wage, color=education, data= training, geom="density")
library(caret)
library(kernlab) ## to get spam dataset
library(RANN)
data(spam)
inTrain <- createDataPartition(y=spam$type, p= 0.75, list = FALSE) ## data is partitioned by spam type, where p is the training proportion.
training <- spam[inTrain,] ##subsetting the output of the partition function by spam
testing <- spam[-inTrain,] ##subsetting the output of the partition function by not spam
dim(training)
## [1] 3451 58
hist(training$capitalAve, main="", xlab="ave.capital run length")
mean(training$capitalAve)
## [1] 5.577148
sd(training$capitalAve) ## what we see here is that the standard deviation is significanly higher than the mean, which indicates that we need some pre-processing.
## [1] 35.92989
trainCapAve <- training$capitalAve
trainCapAveS <- (trainCapAve - mean(trainCapAve))/sd(trainCapAve) ## A way of standardizing the data
mean(trainCapAveS)
## [1] -1.162129e-17
sd(trainCapAveS)
## [1] 1
##If we want to then standardize the test set, we must use the mean from the training set, and the st.dev from the training set.
testCapAve <- testing$capitalAve
testCapAveS <- (testCapAve - mean(trainCapAve))/sd(trainCapAve)
mean(testCapAveS)
## [1] -0.04294102
sd(testCapAveS)
## [1] 0.3437084
##Alternative is passing preProcess into the train() argument
modelFit <- train(type~., data= training, preProcess= c("center", "scale"), method="glm") ##This processes for all of the predictor variables.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9167869 0.8257754
## Prediction models tend to fail with NA. Well use K nearest imputation statistical function.
# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1],size=1,prob=0.05)==1 ## we are adding NA's into the dataset here.
training$capAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58],method="knnImpute") ##removing the 58th column here
capAve <- predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
##think about creating or calculating variables/features to summarize the sample data, in this instance an entire e-mail if we use the spam dataset.
library(ISLR); library(caret); data(Wage);
inTrain <- createDataPartition(y=Wage$wage,
p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
##turning qualitative variables into quantitative ones
table(training$jobclass)
##
## 1. Industrial 2. Information
## 1067 1035
dummies <- dummyVars(wage ~ jobclass,data=training) ##using the dummyVars argument to turn categorical info, such as "industrial" or "information" into integers. Outcome is wage, and jobclass is the predictor.
head(predict(dummies,newdata=training))
## jobclass.1. Industrial jobclass.2. Information
## 161300 1 0
## 155159 0 1
## 376662 0 1
## 450601 1 0
## 377954 0 1
## 228963 0 1
##__________________
##A way to throw out less-meaningful predictors, i.e. near-zero variations in the data.
nsv <- nearZeroVar(training,saveMetrics=TRUE)
nsv ##in this example we can throw out sex and region
##Using the old faithful dataset
library(caret);data(faithful); set.seed(333)
inTrain <- createDataPartition(y=faithful$waiting,
p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
head(trainFaith)
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration") ##plotting the training data.
lm1 <- lm(eruptions ~ waiting, data=trainFaith)
summary(lm1)
##
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.13375 -0.36778 0.06064 0.36578 0.96057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.648629 0.226603 -7.275 2.55e-11 ***
## waiting 0.072211 0.003136 23.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4941 on 135 degrees of freedom
## Multiple R-squared: 0.7971, Adjusted R-squared: 0.7956
## F-statistic: 530.2 on 1 and 135 DF, p-value: < 2.2e-16
##output is y(eruption duration) = 0.073 * (waiting time) - 1.792
## To predict a new value, we can automate this by:
coef(lm1)[1] + coef(lm1)[2]*80 ## coef(lm1)[1] returns the intercept, and [2] returns the slope. We are predicting with a wait time of 80
## (Intercept)
## 4.128276
newdata <- data.frame(waiting=80)
predict(lm1,newdata) ## a shortcut so we don't have to continuously calculate the coeficients.
## 1
## 4.128276
##We can not use the predictions from the training set on the TEST set
pred1 <- predict(lm1,newdata=testFaith,interval="prediction")
ord <- order(testFaith$waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")
##________________
##We can use CARET to do the same, and much faster
modFit <- train(eruptions ~ waiting,data=trainFaith,method="lm") ##eruptions is outcome, waiting is predictor. model built on the training dataset, and method is linear modeling.
summary(modFit$finalModel) ##How we get final model output
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.13375 -0.36778 0.06064 0.36578 0.96057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.648629 0.226603 -7.275 2.55e-11 ***
## waiting 0.072211 0.003136 23.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4941 on 135 degrees of freedom
## Multiple R-squared: 0.7971, Adjusted R-squared: 0.7956
## F-statistic: 530.2 on 1 and 135 DF, p-value: < 2.2e-16
library(ISLR); library(ggplot2); library(caret);
data(Wage); Wage <- subset(Wage,select=-c(logwage)) ##here we are subsetting and removing the variable we are trying to predict
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins wage
## 1. <=Good : 858 1. Yes:2083 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.: 85.38
## Median :104.92
## Mean :111.70
## 3rd Qu.:128.68
## Max. :318.34
##
inTrain <- createDataPartition(y=Wage$wage,##subsetting into test and train
p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training); dim(testing)
## [1] 2102 10
## [1] 898 10
featurePlot(x=training[,c("age","education","jobclass")],
y = training$wage,
plot="pairs")
##________________
modFit<- train(wage ~ age + jobclass + education,
method = "lm",data=training)
finMod <- modFit$finalModel
print(modFit)
## Linear Regression
##
## 2102 samples
## 3 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 35.56759 0.2589245 24.87554
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
plot(finMod,1,pch=19,cex=0.5,col="#00000010") ##residuals vs. fitted. We want to see a straight line
qplot(finMod$fitted,finMod$residuals,colour=race,data=training) ##also fitted vs. residuals. Trying to identify different trends here.